In this writeup I examine the following:

  1. How does sample size vary across datasets?

  2. How does the “exact” version of SCEPTRE (i.e., the version of SCEPTRE that returns an “exact” p-value based on B = 300,000 resamples) perform on the Frangieh IFN-gamma and Papalexi gene data?

A question that I seek to answer is whether SCEPTRE exhibits CAMP, or “Confounder Adjustment via Marginal Permutations” (previously known as “implicit confounder adjustment”).

Sample size

Sample size has emerged as an important quantity in low MOI (and even in high MOI). Here, I study issues related to sample size in a more precise and in-depth way than I have previously.

Definitions of relevant quantities

I begin by defining several quantities – n_nonzero_treatment, n_nonzero_control, and effective_sample_size – on the negative control data. Let a dataset be given. Let \(d\) be the number of NTCs, let \(p\) be the number of genes, and let \(n\) be the number of cells in the dataset. Let \(X \in \mathbb{R}^{n \times p}\) be the cell-by-gene expression matrix. Next, for \(k \in \{1, …, d\},\) let \(n_k\) be the number of cells containing NTC \(k\) (where the NTCs are arbitrarily indexed). Also, let \(X^{1} \in \mathbb{R}^{n_1 \times p}\) be the submatrix of \(X\) consisting of the cells that received NT 1 (and similarly for \(X^2, \dots, X^d\)). For a given NTC \(k\) and gene \(j\), let $$\texttt{n_nonzero_treatment}_{kj}$$ be the number of cells containing NTC \(k\) in which gene \(j\) has nonzero expression, i.e.

\[ \texttt{n_nonzero_treatment}_{jk} = \sum_{i=1}^{n_k} \mathbb{I}(X^k_{i,j} \geq 1).\]

Next, let \(\texttt{n_nonzero_control}_{kj}\) be the number of NT-containg cells excluding NTC \(k\) (i.e., the cells containing NTCs in the set \([d] \setminus \{k\}\)) with nonzero expression, i.e.,

\[ \texttt{n_nonzero_control}_{jk} = \sum_{k' \neq k} \texttt{n_nonzero_treatment}_{k'j} \]

Finally, let effective_sample_size be the sum of n_nonzero_treatment and n_nonzero_control. On the negative control data, effective_sample_size is a function of the gene only:

\[ \texttt{effective_sample_size}_j = \sum_{k = 1}^d \texttt{n_nonzero_treatment}_{jk} \]

I study how effective_sample_size, n_nonzero_treatment, and n_nonzero_control relate to one another.

Empirical relationship between effective_sample_size, n_nonzero_treatment, and n_nonzero_control

First, I plot n_nonzero_treatment against effective_sample_size, faceting by dataset. In other words, I plot

\[ \{ (\texttt{n_nonzero_treatment}_{jk}, \texttt{effective_sample_size}_{j})\}_{j,k} \]

for each dataset.

Clearly, n_nonzero_treatment and effective_sample_size are correlated. This makes sense: for a given gene, if expression is high in cells with a given NTC, then expression likely is high in cells with the remaining NTCs. (Put differently, expression levels are correlated across NTCs: highly expressed genes tend to be highly expressed across NTCs, and similarly for lowly expressed genes.)

Next, I plot n_nonzero_control against effective_sample_size, i.e.

$$\{(\texttt{n_nonzero_control}_{jk}, \texttt{effective_sample_size}_{j})\}_{j,k}$$

Clearly, n_nonzero_control and effective_sample_size are basically the same variable. Again, this makes sense: for a given gRNA, n_nonzero_control is equal to effective_sample_size minus the number of cells with nonzero expression containing that gRNA. Finally, I plot n_nonzero_treatment against n_nonzero_control.

Again, these variables are correlated (but certainly not perfectly) for the same reason that n_nonzero_treatment and effective_sample_size are correlated.

I conclude on the basis of this analysis that, while effective_sample_size is a useful single-number summary of sample size, it is meaningful to look at both at n_nonzero_treatment and n_nonzero_control when examining sample size-related issues.

Comparison across datasets

Here, I make some comparisons related to sample size across datasets. First, I make a barplot of the number of NTCs per dataset.

The Frangieh data contain the greatest number of NTCs, followed by the Schraivogel data and then the Papalexi data. Next, I plot the average number of nonzero cells per NTC (i.e., n_nonzero_treatment) across datasets.

I print the median n_nonzero_treatment value for each dataset below.

sample_size_df_nt |>
  group_by(dataset) |>
  summarize(m = median(n_nonzero_treatment))
## # A tibble: 7 × 2
##   dataset                                      m
##   <fct>                                    <dbl>
## 1 frangieh_co_culture_gene                   9  
## 2 frangieh_control_gene                      6  
## 3 frangieh_ifn_gamma_gene                    7  
## 4 papalexi_eccite_screen_gene               29  
## 5 schraivogel_enhancer_screen               52.5
## 6 schraivogel_ground_truth_perturbseq_gene  37  
## 7 schraivogel_ground_truth_tapseq_gene     165

The Frangieh data have the smallest number nonzero cells per NTC (median under 10). The Papalexi data are indermediate (median about 30), and the Schraivogel data have the greatest number of cells per NTC (median 37 on perturb-seq data; even greater median on other datasets). Finally, I plot the effective_sample_size per dataset.

I print the per-dataset median effective sample size below.

sample_size_df_nt |>
  group_by(dataset) |>
  summarize(m = median(effective_samp_size))
## # A tibble: 7 × 2
##   dataset                                      m
##   <fct>                                    <dbl>
## 1 frangieh_co_culture_gene                   712
## 2 frangieh_control_gene                      436
## 3 frangieh_ifn_gamma_gene                    581
## 4 papalexi_eccite_screen_gene                315
## 5 schraivogel_enhancer_screen               1764
## 6 schraivogel_ground_truth_perturbseq_gene  1186
## 7 schraivogel_ground_truth_tapseq_gene      5775

We see that the effective sample sizes are in a similar range on the Frangieh and Papalexi datasets (while the effective sample sizes on the Schraivogel datasets are larger). The Frangieh data have fewer nonzero cells per NTC but many more NTCs than the Papalexi data, causing the effective sample size to be roughly equal across the Frangieh and Papalexi data.

Now, I examine the number of nonzero cells per targeting gRNA (grouped and ungrouped) across datasets.

sample_size_df_t |>
  group_by(dataset) |>
  summarize(median_nonzero = median(n_nonzero_cells))
## # A tibble: 7 × 2
##   dataset                                  median_nonzero
##   <fct>                                             <dbl>
## 1 frangieh_co_culture_gene                              5
## 2 frangieh_control_gene                                 3
## 3 frangieh_ifn_gamma_gene                               4
## 4 papalexi_eccite_screen_gene                          17
## 5 schraivogel_enhancer_screen                          14
## 6 schraivogel_ground_truth_perturbseq_gene             26
## 7 schraivogel_ground_truth_tapseq_gene                117

Next, I plot the number of nonzero cells per gRNA group:

## `summarise()` has grouped output by 'dataset', 'grna_group'. You can override
## using the `.groups` argument.

Number of hypotheses

First, I plot the number of genes per dataset (after applying mild QC of filtering out genes expressed in fewer than 0.005 of cells).

All datasets have roughly the same number of genes (about 15,000) except for the TAP-seq and enhancer screen datasets (which of course have fewer genes). Next, I plot the number of pairs across datasets.

The Frangieh data of course contain more pairs because they contain more NT gRNAs.

Exact SCEPTRE

I applied the exact version of SCEPTRE (with covariates and using B = 300,000 permutations) to the Papalexi data and Frangieh data.

First, I plot the SCEPTRE exact results on the Frangieh data, comparing to (i) the skew-normal version of SCEPTRE (with covariates), (ii) NB regression, and (iii) SCEPTRE (SN). I plot the results twice: once without QC (top) and once with mild QC (bottom). (“Mild QC” entails filtering for pairs with at least 10 treatment cells with nonzero expression and 10 control cells with nonzero expression. See the next writeup (Writeup 8) for a justification of these thresholds.

SCEPTRE (exact) exhibits good calibration regardless of whether we apply QC. SCEPTRE (SN) and Seurat DE exhibit similar performance on the QC’ed and non-QC’ed data. Furthermore, these methods both perform better on the QC’ed data than on the non-QC’ed data, likely because the CLT-based approximation that these methods employ is better on the QC’ed data. Finally, NB regression does not seem to improve much (or at all) as we apply more stringent QC. This might be due to model misspecification for some fraction of hypotheses tested. The interaction between QC level and method performance is fairly striking (and perhaps expected, given the extent to which these methods rely on the CLT).

Next, I create the same set of plots for the Papalexi data. I plot SCEPTRE (exact), SCEPTRE (SN), Seurat DE, and NB regression. The top plot shows the results on the un-QC’ed data, and the bottom plot shows the results on the QC’ed data (where, again, QC entails filtering for pairs with >7 treatment cells with nonzero expression and >30 control cells with nonzero expression).

I compute the fraction of negative control pairs removed after applying QC. Note that the fraction of pairs filtered on the negative control data should be be much greater than that on the rest of the data, as the negative control data have one fewer NT in the control group and only a single gRNA in the undercover group. (In other words, the sample sizes are generally smaller on the negative control data than on the rest of the data.)

 res |>
   filter(dataset %in% c("frangieh_co_culture_gene",
                         "papalexi_eccite_screen_gene")) |>
   group_by(dataset) |>
   summarize(frac_removed =  1 - mean(n_nonzero_treatment >= n_nonzero_treatment_thresh & n_nonzero_control >= n_nonzero_control_thresh)) 
## # A tibble: 2 × 2
##   dataset                     frac_removed
##   <chr>                              <dbl>
## 1 frangieh_co_culture_gene           0.508
## 2 papalexi_eccite_screen_gene        0.289

Interestingly, SCEPTRE (exact) is miscalibrated on the un-QCed data but is excellently calibrated on the QC’ed data. I conjecture that SCEPTRE (exact) is miscalibrated on the un-QC’ed data for the following reason: some of the pairs are confounded, and the sample size is too small for CAMP to have kicked in. On the QC’ed data, meanwhile, CAMP has kicked in, enabling SCEPTRE to return a well-calibrated p-value for the confounded pairs. In fact, on the QC’ed data, SCEPTRE (exact) exhibits the best calibration of all the methods and is even better than NB regression. This possibly is because NB regression is less robust than SCEPTRE to GLM model misspecification.

Two questions arise: first, why is SCEPTRE (SN) better calibrated than SCEPTRE (exact) on the non-QC’ed data? And second, why is NB regression better calibrated than SCEPTRE on the non-QC’ed data? I attempt to address the first question here. Below, I plot the SCEPTRE (exact) p-values against the SCEPTRE (SN) p-values on a transformed scale.

The SCEPTRE (SN) p-values are in broad agreement with the SCEPTRE (exact) p-values; however, the scatterplot has an obvious “appendage” below the diagonal. The pairs in this “appendage” have much smaller SCEPTRE (exact) p-values than SCEPTRE (SN) p-values. Nearly all the pairs in this appendage are filtered out by QC. Below, I create a similar plot to compare the SCEPTRE p-values to the negative binomial p-values.

These results are even more striking than the results comparing SCEPTRE (exact) to SCEPTRE (SN). There exists a set of pairs for which the NB regression p-value is about 1 and the SCEPTRE (exact) p-value is in the interval (0, 1). (These are the points at the bottom of the plot.) We see that very few of these pairs pass QC. What is the deal with these pairs? I filter for pairs with NB p-values greater than 0.99 and SCEPTRE (exact) p-values less than 0.5. (This is the row of points at the bottom of the plot to the right of the vertical black line.)

comparison_df |>
  filter(p_value_nb > 0.99,
         p_value_exact < 0.5) |>
  summarize(m = mean(n_nonzero_treatment == 0)) |>
  pull() * 100
## [1] 99.62238

Nearly all of these points (99.6%) have zero treatment cells with expression. We thus have an (at least proximate) explanation for the discrepancy between SCEPTRE (exact) and NB regression. When no treatment cell contains nonzero gene expression, SCEPTRE (exact) and NB regression both fail. SCEPTRE (exact) fails in a liberal direction, producing inflated p-values. NB regression, by contrast, fails in a conservative direction, yielding p-values equal to one. We likely will filter out such pairs as part of QC. Thus, on the pairs that we will subject to analysis, NB regression and SCEPTRE (exact) coincide fairly closely. (And in fact, SCEPTRE (exact) exhibits slightly better calibration on the QC’ed pairs, likely due to greater robustness to model misspecification.) The results are encouraging. After applying very mild QC, SCEPTRE emerges as the best method on both the Papalexi and Frangieh data. This is due to CAMP.

Comparing the four versions of SCEPTRE

Here, I plot the four versions of SCEPTRE (with covariates vs. without covariates, skew-normal vs. exact) on the Papalexi and Frangieh data. I apply the mild QC used above. First, I examine the Frangieh IFN-gamma data.

Including or excluding covariates does not seem to have much of an effect (at least on type-I error control). The skew-normal approximation, by contrast, leads to inflation. Next, I examine the Papalexi gene expression data.

The results are fascinating: SCEPTRE (skew-normal and without covariates) performs the worst; SCEPTRE (exact and without covariates) performs second worst; next, SCEPTRE (skew-normal and with covariates) performs second best; finally, SCEPTRE (exact and with covariates) performs best. This result provides clear evidence that the skew-normal fit and confounding both pose challenges to calibration, with confounding being the more serious problem. I plot the results above on an untransformed scale.

In the bulk the skew-normal fit causes more problems than confounding does.